Methods and apparatus to perform speech reference enrollment

ABSTRACT

A speech reference enrollment method involves the following steps: (a) requesting a user speak a vocabulary word; (b) detecting a first utterance ( 354 ); (c) requesting the user speak the vocabulary word; (d) detecting a second utterance ( 358 ); (e) determining a first similarity between the first utterance and the second utterance ( 362 ); (f) when the first similarity is less than a predetermined similarity, requesting the user speak the vocabulary word; (g) detecting a third utterance ( 366 ); (h) determining a second similarity between the first utterance and the third utterance ( 370 ); and (i) when the second similarity is greater than or equal to the predetermined similarity, creating a reference ( 364 ).

This application is a continuation in part of the patent applicationhaving Ser. No. 08/863,462, filed May 27, 1997, entitled “Method ofAccessing a Dial-up Service” and all applications are assigned to thesame assignee as the present application.

FIELD OF THE INVENTION

The present invention is related to the field of speech recognitionsystems and more particularly to a speech reference enrollment method.

BACKGROUND OF THE INVENTION

Both speech recognition and speaker verification application often usean enrollment process to obtain reference speech patterns for later use.Speech recognition systems that use an enrollment process are generallyspeaker dependent systems. Both speech recognition systems using anenrollment process and speaker verification systems will be referredherein as speech reference systems. The performance of speech referencesystems is limited by the quality of the reference patterns obtained inthe enrollment process. Prior art enrollment processes ask the user tospeak the vocabulary word being enrolled and use the extracted featuresas the reference pattern for the vocabulary word. These systems sufferfrom unexpected background noise occurring while the user is utteringthe vocabulary word during the enrollment process. This unexpectedbackground noise is then incorporated into the reference pattern. Sincethe unexpected background noise does not occur every time the userutters the vocabulary word, it degrades the ability of the speechreference system's ability to match the reference pattern with asubsequent utterance.

Thus there exists a need for an enrollment process for speech referencesystems that does not incorporate unexpected background noise in thereference patterns.

SUMMARY OF THE INVENTION

A speech reference enrollment method that overcomes these and otherproblems involves the following steps: (a) requesting a user speak avocabulary word; (b) detecting a first utterance; (c) requesting theuser speak the vocabulary word; (d) detecting a second utterance; (e)determining a first similarity between the first utterance and thesecond utterance; (f) when the first similarity is less than apredetermined similarity, requesting the user speak the vocabulary word;(g) detecting a third utterance; (h) determining a second similaritybetween the first utterance and the third utterance; and (i) when thesecond similarity is greater than or equal to the predeterminedsimilarity, creating a reference.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an embodiment of a speaker verificationsystem;

FIG. 2 is a flow chart of an embodiment of the steps used to form aspeaker verification decision;

FIG. 3 is a flow chart of an embodiment of the steps used to form a codebook for a speaker verification decision;

FIG. 4 is a flow chart of an embodiment of the steps used to form aspeaker verification decision;

FIG. 5 is a schematic diagram of a dial-up service that incorporates aspeaker verification method;

FIG. 6 is a flow chart of an embodiment of the steps used in a dial-upservice;

FIG. 7 is a flow chart of an embodiment of the steps used in a dial-upservice;

FIG. 8 is a block diagram of a speech reference system using a speechreference enrollment method according to the invention in an intelligentnetwork phone system;

FIGS. 9 a & b are flow charts of an embodiment of the steps used in thespeech reference enrollment method;

FIG. 10 is a flow chart of an embodiment of the steps used in anutterance duration check;

FIG. 11 is a flow chart of an embodiment of the steps used in a signalto noise ratio check;

FIG. 12 is a graph of the amplitude of an utterance versus time;

FIG. 13 is a graph of the number of voiced speech frames versus time foran utterance;

FIG. 14 is an amplitude histogram of an utterance; and

FIG. 15 is a block diagram of an automatic gain control circuit.

DETAILED DESCRIPTION OF THE DRAWINGS

A speech reference enrollment method as described herein can be used forboth speaker verification methods and speech recognition methods.Several improvements in speaker verification methods that can be used inconjunction with the speech enrollment method are first described. Nexta dial-up service that takes advantage of the enrollment method isdescribed. The speech enrollment method is then described in detail.

FIG. 1 is a block diagram of an embodiment of a speaker verificationsystem 10. It is important to note that the speaker verification systemcan be physically implemented in a number of ways. For instance, thesystem can be implemented as software in a general purpose computerconnected to a microphone; or the system can be implemented as firmwarein a ceneral purpose microprocessor connected to memory and amicrophone; or the system can be implemented using a Digital SignalProcessor (DSP), a controller, a memory, and a microphone controlled bythe appropriate software. Note that since the process can be performedusing software in a computer. then a computer readable storage mediumcontaining computer readable instructions can be used to implement thespeaker verification method. These various system architectures areapparent to those skilled in the art and the particular systemarchitecture selected will depend on the application.

A microphone 12 receives an input speech and converts the sound waves toan electrical signal. A feature extractor 14 analyzes the electricalsignal and extracts key features of the speech. For instance, thefeature extractor first digitizes the electrical signal. A cepstrum ofthe digitized signal is then performed to determine the cepstrumcoefficients. In another embodiment, a linear predictive analysis isused to find the linear predictive coding (LPC) coefficients. Otherfeature extraction techniques are also possible.

A switch 16 is shown attached to the feature extractor 14. This switch16 represents that a different path is used in the training phase thanin the verification phase. In the training phase the cepstrumcoefficients are analyzed by a code book generator 18. The output of thecode book generator 18 is stored in the code book 20. In one embodiment,the code book generator 18 compares samples of the same utterance fromthe same speaker to form a generalized representation of the utterancefor that person. This generalized representation is a training,utterance in the code book. The training utterance represents thegeneralized cepstrum coefficients of a user speaking the number “one” asan example. A training utterance could also be a part of speech. aphoneme, or a number like “twenty one” or any other segment of speech.In addition to the registered users' samples, utterances are taken froma group of non-users. These utterances are used to form a composite thatrepresents an impostor code having a plurality of impostor references.

In one embodiment, the code book generator 18 segregates the speakers(users and non-users) into male and female groups. The male enrolledreferences (male group) are aggregated to determining a male variancevector. The female enrolled references (female group) are aggregated todetermine a female variance vector. These gender specific variancevectors will be used when calculating a weighted Euclidean distance(measure of closeness) in the verification phase.

In the verification phase the switch 16 connects the feature extractor14 to the comparator 22. The comparator 22 performs a mathematicalanalysis of the closeness between a test utterance from a speaker withan enrolled reference stored in the code book 20 and between the testutterance and an impostor reference distribution. In one embodiment, atest utterance such as a spoken “one” is compared with the “one”enrolled reference for the speaker and the “one” impostor referencedistribution. The comparator 22 determines a measure of closenessbetween the “one” enrolled reference, the “one” test utterance and the“one” impostor reference distribution. When the test utterance is closerto the enrolled reference than the impostor reference distribution, thespeaker is verified as the true speaker. Otherwise the speaker isdetermined to be an impostor. In one embodiment, the measure ofcloseness is a modified weighted Euclidean distance. The modification inone embodiment involves using a generalized variance vector instead ofan individual variance vector for each of the registered users. Inanother embodiment, a male variance vector is used for male speakers anda female variance vector is used for a female speaker.

A decision weighting and combining system 24 uses the measure ofcloseness to determine if the test utterance is closest to the enrolledreference or the impostor reference distribution. When the testutterance is closer to the enrolled reference than the impostorreference distribution, a verified decision is made. When the testutterance is not closer to the enrolled reference than the impostorreference distribution. an un-verified decision is made. These arepreliminary decisions. Usually, the speaker is required to speak severalutterances (e.g., “one”, “three”, “five”, “twenty one”). A decision ismade for each of these test utterances. Each of the plurality ofdecisions is weighted and combined to form the verification decision.

The decisions are weighted because not all utterances provide equalreliability. For instance, “one” could provide a much more reliabledecision than “eight”. As a result, a more accurate verificationdecision can be formed by first weighting the decisions based on theunderlying utterance. Two weighting methods can be used. One weightingmethod uses a historical approach. Sample utterances are compared to theenrolled references to determine a probability of false alarm p_(FA)(speaker is not impostor but the decision is impostor) and a probabilityof miss P_(M) (speaker is impostor but the decision is true speaker).The P_(FA) and P_(M) are probability of errors. These probability oferrors are used to weight each decision. In one embodiment the weightingfactors (weight) are described by the equation below:${a_{i} = \log}{\frac{1 - P_{Mi}}{P_{FAI}}\quad{Decision}\quad{is}\quad{Verified}\quad\left( {{True}\quad{Speaker}} \right)}$${a_{i} = \log}{\frac{P_{Mi}}{1 - P_{FAI}}\quad{Decision}\quad{is}\quad{Not}\quad{Verified}\quad({Imposter})}$

When the sum of the weighted decisions is greater than zero, then theverification decision is a true speaker. Otherwise the verificationdecision is an impostor.

The other method of weighting the decisions is based on an immediateevaluation of the quality of the decision. In one embodiment, this iscalculated by using a Chi-Squared detector. The decisions are thenweighted on the confidence determined by the Chi-Squared detector. Inanother embodiment, a large sample approximation is used. Thus if thetest statistics are t, find b such that c²(b)=t. Then a decision is animpostor if it exceeds the 1-a quantile of the c² distribution. Oneweighting scheme is shown below:

1.5, if b>C_(accept)

1.0, if 1-a≦b≦c_(accept)

−1.0, if c_(reject)≦b≦1-a

−1.25, if b<c_(reject)

When the sum of the weighted decisions is greater than zero, then theverification decision is a true speaker. When the sum of the weighteddecision is less than or equal to zero, the decision is an impostor.

In another embodiment, the feature extractor 14 segments the speechsignal into voiced sounds and unvoiced sounds. Voiced sounds generallyinclude vowels, while most other sounds are unvoiced. The unvoicedsounds are discarded before the cepstrum coefficients are calculated inboth the training phase and the verification phase.

These techniques of weighting the decisions, using gender dependentcepstrums and only using voiced sounds can be combined or usedseparately in a speaker verification system.

FIG. 2 is a flow chart of an embodiment of the steps used to form aspeaker verification decision. The process starts, at step 40, bygenerating a code book at step 42. The code book has a plurality ofenrolled references for each of the plurality of speakers (registeredusers, plurality of people) and a plurality of impostor references. Theenrolled references in one embodiment are the cepstrum coefficients fora particular user speaking a particular utterance (e.g., “one). Theenrolled references are generated by a user speaking the utterances. Thecepstrum coefficients of each of the utterances are determined to fromthe enrolled references. In one embodiment a speaker is asked to repeatthe utterance and a generalization of the two utterances is saved as theenrolled reference. In another embodiment both utterances are saved asenrolled reference.

In one embodiment, a data base of male speakers is used to determine amale variance vector and a data base of female speakers is used todetermine a female variance vector. In another embodiment. the databases of male and female speakers are used to form a male impostor codebook and a female impostor code book. The gender specific variancevectors are stored in the code book. At step 44, a plurality of testutterances (input set of utterances) from a speaker are received. In oneembodiment the cepstrum coefficients of the test utterances arecalculated. Each of the plurality of test utterances are compared to theplurality of enrolled references for the speaker at step 46. Based onthe comparison, a plurality of decision are formed, one for each of theplurality of enrolled references. In one embodiment, the comparison isdetermined by a Euclidean weighted distance between the test utteranceand the enrolled reference and between the test utterance and animpostor reference distribution. In another embodiment, the Euclideanweighted distance is calculated with the male variance vector if thespeaker is a male or the female variance vector if the speaker is afemale. Each of the plurality of decisions are weighted to form aplurality of weighted decisions a step 48. The weighting can be based onhistorical error rates for the utterance or based on a confidence level(confidence measure) of the decision for the utterance. The plurality ofweighted decisions are combined at step 50. In one embodiment the stepof combining involves summing the weighted decisions. A verificationdecision is then made based on the combined weighted decisions at step52, ending the process at step 54. In one embodiment if the sum isgreater than zero, the verification decision is the speaker is a truespeaker, otherwise the speaker is an impostor.

FIG. 3 is a flow chart of an embodiment of the steps used to form a codebook for a speaker verification decision. The process starts, at step70, by receiving an input utterance at step 72. In one embodiment, theinput utterances are then segmented into a voiced sounds and an unvoicedsounds at step 74. The cepstrum coefficients are then calculated usingthe voiced sounds at step 76. The coefficients are stored as a enrolledreference for the speaker at step 78. The process then returns to step72 for the next input utterance, until all the enrolled references havebeen stored in the code book.

FIG. 4 is a flow chart of an embodiment of the steps used to form aspeaker verification decision. The process starts, at step 100, byreceiving input utterances at step 102. Next, it is determined if thespeaker is male or female at step 104. In a speaker verificationapplication, the speaker purports to be someone in particular. If theperson purports to be someone that is a male, then the speaker isassumed to be male even if the speaker is a female. The input utterancesare then segmented into a voiced sounds and an unvoiced sounds at step106. Features (e.g., cepstrum coefficients) are extracted from thevoiced sounds to form the test utterances, at step 108. At step 110, theweighted Euclidean distance (WED) is calculated using a generalized malevariance vector if the purported speaker is a male. When the purportedspeaker is a female, the female variance vector is used. The WED iscalculated between the test utterance and the enrolled reference for thespeaker and the test utterance and the male (or female if appropriate)impostor reference distribution. A decision is formed for each testutterance based on the WED at step 112. The decisions are then weightedbased on a confidence level (measure of confidence) determined using, aChi-squared detector at step 114. The weighted decisions are summed atstep 116. A verification decision is made based on the sum of theweighted decisions at step 118.

Using the speaker verification decisions discussed above results in animproved speaker verification system, that is more reliable than presenttechniques.

A dial-up service that uses a speaker verification method as describedabove is shown in FIG. 5. The dial-up service is shown as a bankingservice. A user dials a service number on their telephone 150. Thepublic switched telephone network (PSTN) 152 then connects the user'sphone 150 with a dial-up service computer 154 at a bank 156. The dial-upservice need not be located within a bank. The service will be explainedin conjunction with the flow chart shown in FIG. 6. The process starts,at step 170, by dialing a service number (communication service address,number) at step 172. The user (requester) is then prompted by thecomputer 154 to speak a plurality of digits (access code, plurality ofnumbers, access number) to form a first utterance (first digitizedutterance) at step 174. The digits are recognized using speakerindependent voice recognition at step 176. When the user has used thedial-up service previously, verifying the user based on the firstutterance at step 178. When the user is verified as a true speaker atstep 178, allowing access to the dial-up service at step 180. When theuser cannot be verified. requesting the user input a personalidentification number (PIN) at step 182. The PIN can be entered by theuser either by speaking the PIN or by entering the PIN on a keypad. Atstep 184 it is determined if the PIN is valid. When the PIN is notvalid, the user is denied access at step 186. When the PIN is valid theuser is allowed access to the service at step 180. Using the abovemethod the dial-up service uses a speaker verification system as a PINoption, but does not deny access to the user if it cannot verify theuser.

FIG. 7 is a flow chart of another embodiment of the steps used in adial-up service. The process starts, step 200, by the user speaking anaccess code to form a plurality of utterances at step 202. At step 204it is determined if the user has previously accessed the service. Whenthe user has previously used the service, the speaker verificationsystem attempts to verify the user (identity) at step 206. When thespeaker verification system can verify the user, the user is allowedaccess to the system at step 208. When the system cannot verify theuser, a PIN is requested at step 210. Note the user can either speak thePIN or enter the PIN on a keypad. At step 212 it is determined if thePIN is valid. When the PIN is not valid the user is denied access atstep 214. When the PIN is valid, the user is allowed access at step 208.

When the user has not previously accessed the communication service atstep 204, the user is requested to enter a PIN at step 216. At step 218it is determined if the PIN is valid at step 218. When the PIN is notvalid, denying access to the service at step 220. When the PIN is validthe user is asked to speak the access code a second time to form asecond utterance (plurality of second utterances, second digitizedutterance) at step 222. The similarity between the first utterance (step202) and the second utterance is compared to a threshold at step 224. Inone embodiment the similarity is calculated using a weighted Euclideandistance. When the similarity is less than or equal to the threshold,the user is asked to speak the access code again at step 222. In thiscase the second and third utterances would be compared for the requiredsimilarity. In practice, the user would not be required to repeat theaccess code at step 222 more than once or twice and the system wouldthen allow the user access. When the similarity is greater than thethreshold, storing a combination of the two utterances as at step 226.In another embodiment both utterances are stored as enrolled references.Next access to the service is allowed at step 208. The enrolledreference is used to verify the user the next time they access theservice. Note that the speaker verification part of the access to thedial-up service in one embodiment uses all the techniques discussed fora verification process. In another embodiment the verification processonly uses one of the speaker verification techniques. Finally, inanother embodiment the access number has a predetermined digit that isselected from a first set of digits (predefined set of digits) if theuser is a male. When the user is a female, the predetermined digit isselected from a second set of digits. This allows the system todetermine if the user is suppose to be a male or a female. Based on thisinformation, the male variance vector or female variance vector is usedin the speaker verification process.

FIG. 8 is a block diagram of a speech reference system 300 using aspeech reference enrollment method according to the invention in anintelligent network phone system 302. The speech reference system 300can perform speech recognition or speaker verification. The speechreference system 300 is implemented in a service node or intelligentperipheral (SN/IP). When the speech reference system 300 is implementedin a service node, it is directly connected to a telephone centraloffice—service switching point (CO/SSP) 304-308. The centraloffice—service switching points 304-308 are connected to a plurality oftelephones 310-320. When the speech reference system 300 is implementedin an intelligent peripheral, it is connected to a service control point(SCP) 322. In this scheme a call from one of the plurality of telephones310-320 invoking a special feature, such as speech recognition, requiresprocessing by the service control point 322. Calls requiring specialprocessing are detected at CO/SSP 304-308. This triggers the COISSP304-308 to interrupt call processing while the CO/SSP 304-308 transmitsa query to the SCP 300, requesting information to recognize a wordspoken by user. The query is carried over a signal system 7 (SS7) link324 and routed to the appropriate SCP 322 by a signal transfer point(STP) 326. The SCP 322 sends a request for the intelligent peripheral300 to perform speech recognition. The speech reference system 300 canbe implemented using a computer capable of reading and executingcomputer readable instructions stored on a computer readable storagemedium 328. The instructions on the storage medium 328 instruct thecomputer how to perform the enrollment method according to theinvention.

FIGS. 9 a & b are flow charts of the speech reference enrollment method.This method can be used with any speech reference system, includingthose used as part of a intelligent telephone network as shown in FIG.8. The enrollment process starts, step 350, by receiving a firstutterance of a vocabulary word from a user at step 352. Next, aplurality of features are extracted from the first utterance at step354. In one embodiment, the plurality of features are the cepstrumcoefficients of the utterance. At step 356, a second utterance isreceived. In one embodiment the first utterance and the second utteranceare received in response to a request that the user speak the vocabularyword. Next, the plurality of features are extracted from the secondutterance at step 358. Note that the same features are extracted forboth utterances. At step 360, a first similarity is determined betweenthe plurality of features from the first utterance and the plurality offeatures from the second utterance. In one embodiment, the similarity isdetermined using a hidden Markov model Veterbi scoring system. Then itis determined if the first similarity is less than a predeterminedsimilarity at step 362. When the first similarity is not less than thepredetermined similarity, then a reference pattern (reference utterance)of the vocabulary is formed at step 364. The reference pattern, in oneembodiment, is an averaging of the features from the first and secondutterance. In another embodiment, the reference pattern consists ofstoring the feature from both the first utterance and the secondutterance, with a pointer from both to the vocabulary word.

When the first similarity is less than the predetermined similarity,then a third utterance (third digitized utterance) is received and theplurality of features from the third utterance are extracted at step366. Generally, the utterance would be received based on a request bythe system. At step 368, a second similarity is determined between thefeatures from the first utterance and the third utterance. The secondsimilarity is calculated using the same function as the firstsimilarity. Next, it is determined if the second similarity is greaterthan or equal to the predetermined similarity at step 370. When thesecond similarity is greater than or equal to the predeterminedsimilarity, a reference is formed at step 364. When the secondsimilarity is not greater than or equal to the predetermined similarity,then a third similarity is calculated between the features from thesecond utterance and the third utterance at step 372. Next, it isdetermined if the third similarity is greater than or equal to thepredetermined similarity at step 374. When the third similarity isgreater than or equal to the predetermined similarity, a reference isformed at step 376. When the third similarity is not greater than orequal to the predetermined similarity, starting the enrollment processover at step 378. Using this method the enrollment process avoidsincorporating unexpected noise or other abnormalities into the referencepattern.

In one embodiment of the speech reference enrollment method of FIGS. 9 a& b, a duration check is performed for each of the utterances. Theduration check increases the chance that background noise will not beconsidered to be the utterance or part of an utterance. A flow chart ofthe duration check is shown in FIG. 10. The process starts, step 400, bydetermining the duration of the utterance at step 402. Next, it isdetermined if the duration is less than a minimum duration at step 404.When the duration is less than the minimum duration, the utterance isdisregarded at step 406. In one embodiment, the user is then requestedto speak the vocabulary word again and the process is started over. whenthe duration is not less than the minimum duration, it is determined ifthe duration is greater than a maximum duration at step 408. When theduration is greater than a maximum duration, the utterance isdisregarded at step 406. When the duration is not greater than themaximum duration, the utterance is kept for further processing. at step410.

Another embodiment of the speech reference enrollment method checks ifthe signal to noise ratio is adequate for each utterance. This reducesthe likely that a noisy utterance will be stored as a reference pattern.The method is shown in the flow chart of FIG. 11. The process starts,step 420, by receiving an utterance at step 422. Next, the signal tonoise ratio is determined at step 424. At step 426, it is determined ifthe signal to noise ratio is greater than a threshold (predeterminedsignal to noise ratio). When the signal to noise ratio is greater thanthe threshold, then the utterance is processed at step 428. When thesignal to noise ratio is not greater than the threshold, anotherutterance is requested at step 430.

FIG. 12 is a graph 450 of the amplitude of an utterance versus time andshoves one embodiment of how the duration of the utterance isdetermined. The speech reference system requests the user speak avocabulary which begins the response period (utterance period) 452. Theresponse period ends at a timeout (timeout period) 454 if no utteranceis detected. The amplitude is monitored and when it crosses above anamplitude threshold 456 it is assumed that the utterance has started(start time) 458. When the amplitude of the utterance falls below thethreshold, it is marked as the end time 460. The duration is calculatedas the difference between the end time 460 and the start time 458.

In another embodiment of the invention, the number (count) of voicedspeech frames that occur during the response period or between a starttime and an end time is determined. The response period is divided intoa number of frames, generally 20 ms long, and each frame ischaracterized either as a unvoiced frame or a voiced frame. FIG. 13shows a graph 470 of the estimate of the number of the voiced speechframes 472 during the response period. When the estimate of the numberof voiced speech frames exceeds a threshold (predetermined number ofvoiced speech frames), then it is determined that a valid utterance wasreceived. When the number of voiced speech frames does not exceed thethreshold, then it is likely that noise was received instead of a validutterance.

In another embodiment an amplitude histogram of the utterance isperformed. FIG. 14 is an amplitude histogram 480 of an utterance. Theamplitude histogram 480 measures the number of samples in each bit ofamplitude from the digitizer. When a particular bit 482 has no or veryfew samples, the system generates a warning message that a problem mayexist with the digitizer. A poorly performing digitizer can degrade theperforms of the speech reference system.

In another embodiment, an automatic gain control circuit is used toadjust the amplifier gain before the features are extracted from theutterance. FIG. 15 is a block diagram of an automatic gain controlcircuit 500. The circuit 500 also includes some logic to determine ifthe utterance should be kept for processing or another utterance shouldbe requested. An adjustable gain amplifier 502 has an input coupled toan utterance signal line (input signal) 504. The output 506 of theamplifier 502 is connected to a signal to noise ratio meter 508. Theoutput 510 of the signal to noise ratio meter 508 is coupled to acomparator 512. The comparator 512 determines if the signal to noiseratio is greater than a threshold signal to noise ratio 514. When thesignal to noise ratio is less than the threshold a logical one is outputfrom the comparator 512. The output 513 of the comparator 512 is coupledto an OR gate 514 and to an increase gain input 516 of the adjustablegain amplifier 502. When the output 513 is a logical one, the gain ofthe amplifier 516 is increased by an incremental step.

The output 506 of the amplifier 502 is connected to a signal line 518leading to the feature extractor. In addition, the output 506 isconnected to an amplitude comparator 520. The comparator 520 determinesif the output 506 exceeds a saturation threshold 522. The output 524 isconnected to the OR gate 514 and a decrease gain input 526 of theamplifier 502. When the output 506 exceeds the saturation threshold 522,the comparator 520 outputs a logical one that causes the amplifier 502to reduce its gain by an incremental step. The output of the OR gate 514is a disregard utterance signal line 528. When the output of the OR gateis a logical one the utterance is disregarded. The circuit reduces thechances of receiving a poor representation of the utterance due toincorrect gain of the input amplifier.

Thus there has been described a speech reference enrollment method thatsignificantly reduces the chances of using a poor utterance for forminga reference pattern. While the invention has been described inconjunction with specific embodiments thereof, it is evident that manyalterations, modifications, and variations will be apparent to thoseskilled in the art in light of the foregoing description. Accordingly,it is intended to embrace all such alterations, modifications, andvariations in the appended claims.

1-22. (canceled)
 23. A method, comprising: receiving a first utteranceof a word; receiving a second utterance of the word; when a number ofvoiced speech frames associated with the second utterance is greaterthan a threshold, determining a first similarity between the firstutterance and the second utterance; and when the first similarity isgreater than or equal to a similarity threshold, storing a reference forthe word.
 24. A method as defined in claim 23, wherein determining thefirst similarity between the first utterance and the second utterancecomprises determining a similarity between a first plurality of featuresassociated with the first utterance and a second plurality of featuresassociated with the second utterance.
 25. A method as defined in claim23, further comprising: when the first similarity is less than thesimilarity threshold, requesting a user to speak a third utterance ofthe word; determining a second similarity between the first utteranceand the third utterance; and when the second similarity is greater thanor equal to the similarity threshold, storing the reference for theword.
 26. A method as defined in claim 23, further comprisingdetermining the number of voiced speech frames by estimating the numberof voiced speech frames.
 27. A method as defined in claim 23, furthercomprising requesting the user repeat the word when the number of voicedspeech frames is less than the threshold.
 28. A method as defined inclaim 23, further comprising: determining a signal to noise ratio of thefirst utterance; and when the signal to noise ratio is less than athreshold signal to noise ratio, increasing a gain of a voice amplifier.29. A method as defined in claim 23, further comprising determining anamplitude histogram of the first utterance.
 30. A method as defined inclaim 23, further comprising retrieving the reference and verifying aspeaker based on the reference.
 31. A method, comprising: receiving afirst utterance of a word; receiving a second utterance of the word;when the duration of the second utterance is greater than or equal to afirst duration or less than or equal to a second duration, storing areference for the word.
 32. A method as defined in claim 31, furthercomprising: when the duration of the second utterance is greater than orequal to the first duration or less than or equal to the secondduration, determining a first similarity between the first utterance andthe second utterance; and when the first similarity is greater than orequal to a similarity threshold, storing the reference.
 33. A method asdefined in claim 32, further comprising: when the first similarity isless than the similarity threshold, requesting the user speak the word;determining a second similarity between the first utterance and a thirdutterance; and when the second similarity is greater than or equal tothe similarity threshold, storing the reference.
 34. A method as definedin claim 33, further comprising: determining a third similarity betweenthe second utterance and the third utterance; and when the thirdsimilarity is greater than or equal to the similarity threshold, storingthe reference.
 35. A method as defined in claim 31, further comprising:determining if the first utterance exceeds an amplitude threshold withina time period; and when the first utterance does not exceed theamplitude threshold within the time period, requesting the user re-speakthe word.
 36. A method as defined in claim 31, further comprising:associating a start time with a point at which a first amplitude of thefirst utterance is greater than an amplitude threshold; associating anend time with a point at which a second amplitude of the first utteranceis less than the amplitude threshold; and determining the duration as adifference between the end time and the start time.
 37. A method asdefined in claim 35, further comprising: determining a signal to noiseratio of the first utterance; and when the signal to noise ratio is lessthan a threshold signal to noise ratio, increasing a gain of a voiceamplifier.
 38. A method as defined in claim 31, further comprisingretrieving the reference and verifying a speaker based on the reference.39. A machine accessible storage medium having instructions storedthereon that, when executed, cause a machine to: receive a firstutterance of a word; receive a second utterance of the word; when theduration of the second utterance is greater than or equal to a firstduration or less than or equal to a second duration, store a referencefor the word.
 40. A machine accessible storage medium as defined inclaim 39 having instructions stored thereon that, when executed, causethe machine to: when the duration of the second utterance is greaterthan or equal to the first duration or less than or equal to the secondduration: determine a first similarity between the first utterance andthe second utterance; and when the first similarity is greater than orequal to a similarity threshold, store the reference.
 41. A machineaccessible storage medium as defined in claim 40 having instructionsstored thereon that, when executed, cause the machine to: when the firstsimilarity is less than the similarity threshold: request the user speakthe word; determine a second similarity between the first utterance anda third utterance; and when the second similarity is greater than orequal to the similarity threshold, store the reference.
 42. A machineaccessible storage medium as defined in claim 41 having instructionsstored thereon that, when executed, cause the machine to: determine athird similarity between the second utterance and the third utterance;and when the third similarity is greater than or equal to thepredetermined similarity, store the reference.
 43. A machine accessiblestorage medium as defined in claim 39 having instructions stored thereonthat, when executed, cause the machine to: determine if the firstutterance exceeds an amplitude threshold within a time period; and whenthe first utterance does not exceed the amplitude threshold within thetime period, request the user re-speak the word.
 44. A machineaccessible storage medium as defined in claim 39 having instructionsstored thereon that, when executed, cause the machine to: associate astart time with a point at which a first amplitude of the firstutterance is greater than an amplitude threshold; associate an end timewith a point at which a second amplitude of the first utterance is lessthan the amplitude threshold; and determine the duration as a differencebetween the end time and the start time.
 45. A machine accessiblestorage medium as defined in claim 39 having instructions stored thereonthat, when executed, cause the machine to: determine a signal to noiseratio of the first utterance; and when the signal to noise ratio is lessthan a threshold signal to noise ratio, increase a gain of a voiceamplifier.
 46. A machine accessible storage medium as defined in claim39 having instructions stored thereon that, when executed, cause themachine to determine an amplitude histogram of the first utterance. 47.A machine accessible storage medium as defined in claim 39 havinginstructions stored thereon that, when executed, cause the machine toretrieve the reference and verify a speaker based on the reference.